perm filename NF.DOC[4,KMC]1 blob
sn#149664 filedate 1975-03-12 generic text, type T, neo UTF8
00100 COMMENT ⊗ VALID 00003 PAGES
00200 C REC PAGE DESCRIPTION
00300 C00001 00001
00400 C00002 00002 Specifications and features for new FRONT for PARRY.
00500 C00010 00003 ***** DATA STORAGE FORMAT *********
00600 C00015 ENDMK
00700 C⊗;
00100 Specifications and features for new FRONT for PARRY.
00200 Some are already done in the old FRONT.
00300 Is there any reason that the following can't be done in SAIL as easily as MLISP?
00400 BILL Historical reasons:
00500 Lisp's arbitrary data structures
00600 Especially good for retaining record of transformations performed.
00700 Lisp's interpreter for debugging
00800
00900 ***** PHILOSOPHY **********
01000
01100 Treat SYNONM, IRREG, IDIOM, SPATS, & CPATS in uniform manner.
01200 No more clear distinction between levels of processing
01300 (input, re-spelled, canonized, pattern, λ####).
01400 The following description sounds like a production system.
01500
01600 ***** MATCHING ALGORITHM **********
01700
01800 Let's see what Terry W. has been thinking about along these lines?
01900
02000 Single look-up program finding longest match. Try to start match at front of
02100 input but skip over stuff to work on middle sometimes. Skip when longest
02200 match in table still has more stuff in it, but input differs.
02300 Use remainder of pattern to choose from multiple meanings for following
02400 word in input.
02500 e.g. how's she doing
02600 HOW IS she doing
02700 how BE she doing
02800 how be THEY doing
02900 how be NURSES doing
03000 how be NURSE doing
03100 how be nurse DO
03200 λ1234
03300
03400 Look-up algorithm must provide for word dropping (& letter & segment ?)
03500 A few meanings constantly change (ie "they", "bug"?)
03600 Must call function to compute meaning or
03700 update stored data for lookup.
03800
03900 "Window" internal workings.
04000 Accept multiple sentences at once.
04100 Have all intermediate steps available until after response given.
04200 Keep a list of all transformations performed.
04300 Allow additional information to be learned during interview.
04400 Might be hard to insert in permanent data storage, but could be stuck on.
04500
04600 ***** SUFFIXES & SPELLING **********
04700
04800 See DICTIO & PREFIX & SUFFIX on [PAR,RCP] for data.
04900
05000 Clean up characters on input.
05100
05200 Use algorithm like present FRONT:
05300 1) Look up
05400 2) Suffix removal
05500 3) Re-spelling
05600
05700 Make sure that suffix is legal on that class of word
05800 and what effect it has on the class. (i.e. part-of-speech → part-of-speech)
05900
06000 Gain information from prefixes & suffixes.
06100 Replace suffix with part of speech, number, tense, etc.
06200
06300 For spelling correction, see SPELL.REG[UP,DOC]
06400 It does 1 extra, 1 missing, 1 wrong, 2 transposed.
06500 I will do:
06600 1 extra
06700 1 missing - add "E" at end (for suffixer)
06800 1 wrong - neighboring keys
06900 - "O" for "0"
07000 - "Y" for "I" at end (for suffixer)
07100 2 transposed
07200
07300 Keep count of spelling corrections and failures.
07400 Handle run-on words.
07500 Treat prefixes & suffixes as run-on words?
07600 Try to identify misspelling after suffix removal. How?
07700
07800 ***** PARTS OF SPEECH **********
07900
08000 Higher level classification of words:
08100 NOUN for (DOES x GET ALONG WITH y)
08200 VERB
08300 PRONOUN for reflexives, at least
08400 ADJECTIVE for (I BE x)
08500 Have a table (matrix) giving a λ#### for specific values of x & y.
08600
08700 Retain HAVE, BE, & DO and change all others to "VERB".
08800 Change all nouns to "NOUN".
08900 Change all adjectives to "ADJECTIVE"
09000 Is it possible to distinguish NOUN-VERB-ADJECTIVE?
09100
09200 ***** PHRASES **********
09300
09400 For disambiguation (= idioms) see Stone's papers.
09500
09600 Want to temporarily replace a verb with "verb" , remove auxiliaries, then
09700 bring back specific verb.
09800
09900 Use start-of-sentence marker to require a noun in front of some patterns.
10000
10100 Verb phrase eliminator: to turn all of:
10200 I GO
10300 I WENT
10400 I DID GO into: I GO .
10500 I WAS GOING
10600 I HAVE GONE
10700 I HAVE BEEN GOING
10800
10900 Also interrogatives:
11000 DID I GO
11100 HAVE I GONE into: I GO ?
11200 WAS I GOING
11300 HAVE I BEEN GOING
11400
11500 Also passives:
11600 I AM TAKEN
11700 I AM BEING TAKEN
11800 I WAS TAKEN into: x TAKE I .
11900 I HAVE BEEN TAKEN
12000 I WAS BEING TAKEN
12100
12200 Also passive interrogatives:
12300 AM I TAKEN
12400 AM I BEING TAKEN
12500 WAS I TAKEN into: x TAKE I ?
12600 HAVE I BEEN TAKEN
12700 WAS I BEING TAKEN
12800
12900 Modal verbs might also be removed:
13000 WILL
13100 CAN
13200 SHOULD
13300 OUGHT to
13400 NEED to
13500 WANT to
13600 BE ABLE to
13700 TRY to
13800
13900 What are:
14000 SEEM to be
14100 LIKE to
14200
14300 Have a pre-determined set of flags indicating what was removed.
14400
14500 How will I gobble up a "noun phrase" in the input?
14600
14700 ***** OTHER **********
14800
14900 Represent both "start of sentence" and "end of sentence" in patterns.
15000 What about "not"?
15100 Learn about unknown words from context.
15200 Patterns could specify restrictions on part of speech.
15300 Picking up Doctor's name is an instance of this?
15400
15500 Ellipsis problems. Try partial match with previous sentence(s) to discover
15600 missing parts.
15700
15800 KEN..CAN PARRY PATTERN MATCH HIS OWN OUTPUT?
15900 Only if the Dr might also say it.
16000 HOW MANY AMBIGUITIES IS HE STILL MISSING?
16100 When they arise, patterns are included to distinguish important meanings.
16200
16300 Write prototype program. Make sure it can do everything FRONT does?!
16400 BILL Dont need to do everthing FRONT does. Only the common easy ones.
00100 ***** DATA STORAGE FORMAT *********
00200
00300 All jumbled together in one huge table? Wastes space(or search time).
00400 Fixed (maximum) size entries (and values) uses too much room. This can be
00500 avoided by separating tables by size. Then getting longest match
00600 requires searching tables for all sizes (starting at biggest) until found.
00700 Would a compromise get the best of both or the worst of both? For example;
00800 just a few (= 2 or 3) tables for different sizes of data.
00900 Arbitrary size entries (and values) requires finding special marker characters
01000 which separate stuff. This makes binary search harder and lots slower.
01100
01200 TWO IDEAS FOR STORING THE FRONT END'S TABLES:
01300
01400 ***** 1) SIMPLE AND LOTS OF DISK READS **********
01500
01600 The tables are stored on disk. Assuming each entry is composed of 36-bit words
01700 of 5 ascii chars each, store each entry packed in right next to the rest.
01800 The first word of each is either a number telling how many words the entry is,
01900 or a special word indicating a break between entries.
02000
02100 In core is a table listing:
02200 1) the first entry on a disk record,
02300 2) the address on the disk of the disk record.
02400
02500 The lookup consists of a binary search thru the table for the proper
02600 disk record, reading in the disk record in dump mode, and searching
02700 linearly the disk record for the proper entry (linear search).
02800
02900 If the table in core is too long, it can be cut in half by reading in 2 records
03000 from disk and having the entries to every other record in the table.
03100 And the table can be halved again...
03200
03300 This method takes one disk read per entry lookup, assuming the entire
03400 index table is in core. This may be a problem, because each entry
03500 in the table is an arbitrary length, although realistically it may
03600 be possible to distinguish between two entries by the first five words.
03700
03800
03900
04000 ***** 2) CLEVER AND SPACE-SAVING AND PROGRAMMING CHALLENGE **********
04100
04200 This looks more like a transition net implementation.
04300
04400 Everything is in core.
04500
04600 The first word of each entry is stored in an array. The array entries consist
04700 of this first word of ascii, and a pointer to the array of second ascii words
04800 which can follow this particular first one. etc. The final pointer is flagged
04900 as a pointer to the λ or whatever.
05000
05100 The lookup consists of binary searching the top-level array on the first ascii
05200 word to be matched. The pointer to the next-level table is used, and again
05300 a binary search. etc.
05400
05500 This method is the most compact that REM and I could think of. The access
05600 might be faster than the previous one. One problem is programming. The
05700 other one is setting up the data structure. It is possible, but a fascinating
05800 task in itself.